BERTScore: Evaluating Text Generation with BERT

https://arxiv.org/abs/1904.09675

Abstract

BERTScore computes a similarity score for each token in the candidate sentence with each token in the reference sentence.

instead of exact matches, we compute token similarity using contextual embeddings.

BERTScore correlates better with human judgments and provides stronger model selection performance than existing metrics.

「人間の判断と相関する」

Finally, we use an adversarial paraphrase detection task to show that BERTScore is more robust to challenging examples when compared to existing metrics.

「頑健」

（知りたい）実装bert-scoreではprecision, recall, f1を出しているが、どうやっている？

Figure 1（Recallを求める例）

Reference: the weather is cloud today

Candidate: it is freezing today

トークンごとにembeddingにする

Referenceからは5つのembedding

Candidateからは4つのembedding

Referenceのembedding 1つとCandidateのembedding 1つから、コサイン類似度

5 × 4で20個の値

Recallなので、Referenceから見てgreedy match

referenceの5つのトークンについて、最大のコサイン類似度をとる

オプショナルなidfによる重要度の重みも加える（掛け算）

Recall = (referenceの各トークンについて最大のコサイン類似度 * トークン自身idf の総和) / (idfの総和)

Precisionはこれが逆（Candidateの4トークンに対して）になると理解した

2 Problem statement and prior metricsは積ん読

従来のBLEUscoreでは正しく評価できない！自然言語に最適な人間に近い評価基準BERTScore登場！

参照文と候補文の類似性を求める時は、このコサイン類似度が最大となるトークン同士をマッチングさせて精度を求めます(Maximum Similarity)。これらをもとに定義されるスコア(適合率、再現率、F1スコア)は以下のようになります。